class: inverse, center, middle

Introduction to single cell sequencing



Overview

  • High-throughput Sequencing
  • Emergence of single-cell
  • Differing single-cell assays
  • Differing single-cell technologies
  • Overview of Single-cell sequencing

Sequencing

The emergence of high-throughput sequencing technologies such as 454 (Roche) and Solexa (Illumina) sequencing allowed for the highly parallel short read sequencing of DNA molecules.

  • 454 (Roche) ~ 2005
  • Solexa (Illumina) ~ 2006

Sequencing overview

overview

Sequence files - FastQ

FastQ (FASTA with Qualities)

igv

  • “@” followed by identifier.
  • Sequence information.
  • “+”
  • Quality scores encodes as ASCI.

Sequence files - FastQ

FastQ - Header

igv

  • Header for each read can contain additional information
    • HS2000-887_89 - Machine name.
    • 5 - Flowcell lane.
    • /1 - Read 1 or 2 of pair (here read 1)

Sequence files - FastQ

FastQ - Qualities

igv

  • Qualities follow “+” line.
  • -log10 probability of sequence base being wrong.
  • Encoded in ASCI to save space.
  • Used in quality assessment and downstream analysis

Sequencing use cases

  • ChIP-seq - Study transcription factor binding
  • RNA-seq - Study genome wide transcriptomics
  • Whole Genome/Exome Sequencing - Study genome variation/changes
  • MNAseq/DNAse-seq/ATAC-seq - Study chromatin structure/accessibility
  • CrispR-seq - Study genome wide phenotypes
  • And many more..

Bulk sequencing methods

Sequencing typically performed on bulk tissue or cells.

  • Tissues - Liver vs Heart
  • Conditions - Cancer vs surrounding normal tissue
  • Treatments - Hep-G2 cells with/without rapamycin
  • Genetic perturbation - Ikzf WT/KO in B-cells.

Analysis of the bulk characteristics of data without understanding of hetergeneity of data.

FACS Cell and Nuclei sorting

Newer technologies such as TRAP from the Heintz lab or nuclei sorting allow for capture of distinct cell types based on expressed markers.

Pros - Allow for the capture of rare cell populations such as specific neuron types.

Cons - Require known markers for desired cell populations.

Single-cell sequencing

With the advent of advanced microfluidics and refined sequencing technologies, single-cell sequencing has emerged as a technology to profile individual cells from a heterogeneous population without prior knowledge of cell populations.

Pros - No prior knowledge of cell populations required. - Simultaneously assess profiles of 1000s of cells.

Cons - Low sequencing sequencing depth for individual cells (1000s vs millions of reads for bulk).

Single-cell use cases

Single-cell sequencing, as with bulk sequencing, has now been applied to the study of a wide range of differing assays.

  • scRNA-seq/snRNA-seq - Transcriptomics
  • scATAC-seq/snATAC-seq - Chromatin accessibility
  • CITE-seq - Surface protein expression
  • AIRR - TCR/BCR profiling
  • Spatial-seq - Spatial profiling of gene expression.

Single-cell platforms

  • 10x Genomics (microfluidics)
  • Smart-seq3 (microfluidics)
  • Drop-seq (microfluidics)
  • VASA-seq (FANS)
  • inDrop-seq (microfluidics)
  • In-house solutions

Many companies offer single-cell sequencing technologies which may be used with the Illumina sequencer.

Major Single-cell platforms

Two popular major companies offer the most used technologies.

  • 10x Genomics
  • Smart-seq3

Major difference between the two are the sequencing depth and coverage profiles across transcripts.

  • 10x is 3’ biased so will not provide full length transcript coverage.
  • As 10x is sequencing only the 3’ of transcripts/genes, less sequencing depth is required per transcript/gene

The 10x Genomics system

overview

The 10x Genomics system - Microfluidics

overview

The 10x Genomics system - Gel beads

overview

The 10x Genomics system - Sequencing primers

overview

The 10x Genomics system - Sequencing R1

Read 1

The 10x Genomics system - Sequencing R2

Read 2

The 10x sequence reads

The sequence reads contain:-

  • Information on cell identity (10x barcode)
  • Information on molecule identity (UMI)
  • Information on sample identity (Index read)
  • Information on the RNA molecule (transcriptome read)

From Sequences to Processed data

As with standard bulk sequencing data, the next steps are typically to align the data to a reference genome/transcriptome and summarize data to a signal matrix.

  • Align reads to transcriptome and summarize signal to genes for RNAseq
  • Align reads to genome, call enriched windows/peaks and summarize signal in peaks.

Processing scRNA-seq/snRNA-seq data.

For the processing of scRNA/snRNA from fastQ to count matrix, there are many options available to us.

Alignment and counting - Cellranger count - STAR - STARsolo - Subread cellCounts

Pseudoalignment and counting - Salmon - Alevin - Kallisto - Bustools

Droplets to counts

The output of these tools is typically a matrix of the signal attributed to cells and genes (typically read counts).

This matrix is the input for all downstream post-processing, quality control, normalization, batch correction, clustering, dimension reduction and differential expression analysis.

The output matrix is often stored in a compressed format such as:- - MEX (Market Exchange Format) - HDF5 (Hierarchical Data Format)

]

overview ]

HDF5 format

  • Optimized format for big data.
  • Contains matrix information alongside column and row information.
  • Loom format version of HDF5 most popular.
  • Efficient to work with programmatically.

MEX format

  • Plain/Compressed text format
  • Contains matrix information in tsv file
  • Separate row and column files contain barcode (cell) and feature (gene/transcript) information.

This is an example of a directory produced by Cell Ranger.

Cell Ranger

Cell Ranger is the typical approach we use to process 10x data. The default setting are pretty good. This is an intensive program, so we will not be running this locally on your laptops. Instead we run it on remote systems , like the HPC.

If you are working with your own data, the data will often be provided as the Cell Ranger output by the Genomics/Bioinformatics teams, like here at Rockefeller University.